2.2 DataFrame

Content:

2.2.1 DataFrame Structure
2.2.2 Working with Columns
2.2.3 Working with Rows
2.2.4 Conditional Selection
2.2.5 Case Study: Olympic Games

A DataFrame is a two dimensional data structure with columns of potentially different data types. It can be considered like a table or a spreadsheet, similar to R's data.frame.

A DataFrame has both row index and column index.



In [1]:

    
import numpy as np
import pandas as pd

2.1.1 DataFrame Structure

Initializing a Dataframe.



In [2]:

    
# Create a DataFrame from dictionary
data = {'Region':['Central','East','North','North-East','West'],
        'Area':[132.7,93.1,134.5,103.9,201.3],
       'Population':[939890,693500,531860,834450,903010]}
df = pd.DataFrame(data)
# Display df
df



In [3]:

    
# Note that the row index are assigned automatically and column index are arranged in alphabetical order.
# Rearrange columns
df = pd.DataFrame(data,columns=['Region','Population','Area'])
df



In [4]:

    
# Display columns names
df.columns









    Out[4]:





Index([u'Region', u'Population', u'Area'], dtype='object')



In [5]:

    
# Display all values
df.values









    Out[5]:





array([['Central', 939890L, 132.7],
       ['East', 693500L, 93.1],
       ['North', 531860L, 134.5],
       ['North-East', 834450L, 103.9],
       ['West', 903010L, 201.3]], dtype=object)



In [6]:

    
# Get the number of rows and columns of the dataframe i.e. its shape
df.shape









    Out[6]:





(5, 3)



In [7]:

    
# Size of DataFrame = row x column
df.size









    Out[7]:





15



In [8]:

    
# Number of rows
len(df)









    Out[8]:





5



In [11]:

    
# Programming specific information of the dataframe
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
Region        5 non-null object
Population    5 non-null int64
Area          5 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 192.0+ bytes



In [9]:

    
# Statistical description of numerical columns
df.describe()









    Out[9]:







  
    
      
      Population
      Area
    
  
  
    
      count
      5.00000
      5.000000
    
    
      mean
      780542.00000
      133.100000
    
    
      std
      167865.50828
      42.148547
    
    
      min
      531860.00000
      93.100000
    
    
      25%
      693500.00000
      103.900000
    
    
      50%
      834450.00000
      132.700000
    
    
      75%
      903010.00000
      134.500000
    
    
      max
      939890.00000
      201.300000

2.2.2 Working with columns



In [12]:

    
# Rename a column label
df = df.rename(columns={'Population':'Pop'})
df



In [13]:

    
# Select a single column to series
A = df['Area'] # same answer as df.Area
A









    Out[13]:





0    132.7
1     93.1
2    134.5
3    103.9
4    201.3
Name: Area, dtype: float64



In [14]:

    
# Select a single column to dataframe
B = df[['Area']]
B



In [15]:

    
# Select multiple columns to dataframe
C = df[['Area','Pop']]
C



In [16]:

    
# Change order of columns
D = df[['Region','Area','Pop']]
D



In [17]:

    
# Drop a column by label
E = df.drop('Area',axis=1)
E



In [21]:

    
# Create a new column 'Density' = 'Population'/'Area'
df['Density'] = df['Pop']/df['Area']
df









    Out[21]:







  
    
      
      Region
      Pop
      Area
      Density
    
  
  
    
      0
      Central
      939890
      132.7
      7082.818387
    
    
      1
      East
      693500
      93.1
      7448.979592
    
    
      2
      North
      531860
      134.5
      3954.349442
    
    
      3
      North-East
      834450
      103.9
      8031.280077
    
    
      4
      West
      903010
      201.3
      4485.891704



In [22]:

    
# Sort values
df.sort_values(by=['Pop'], ascending=False)









    Out[22]:







  
    
      
      Region
      Pop
      Area
      Density
    
  
  
    
      0
      Central
      939890
      132.7
      7082.818387
    
    
      4
      West
      903010
      201.3
      4485.891704
    
    
      3
      North-East
      834450
      103.9
      8031.280077
    
    
      1
      East
      693500
      93.1
      7448.979592
    
    
      2
      North
      531860
      134.5
      3954.349442



In [23]:

    
# Find index label for max/min values
df['Density'].idxmax()









    Out[23]:





3

Exercise: Which region has the highest density? Can you get the answer without sorting?



In [23]:

    
# method 1: 
df['Region'][df['Density'].idxmax()]









    Out[23]:





'North-East'



In [24]:

    
# method 2: Change the index
df1 = df.set_index('Region')
df1['Density'].idxmax()









    Out[24]:





'North-East'

There are many commonly used column-wide methods/attributes:

df['col'].size
df['col'].count()
df['col'].sum()
df['col'].max()
df['col'].mean()
df['col'].std()



In [25]:

    
# Get all numerical summaries of a column
df['Pop'].describe()









    Out[25]:





count         5.00000
mean     780542.00000
std      167865.50828
min      531860.00000
25%      693500.00000
50%      834450.00000
75%      903010.00000
max      939890.00000
Name: Pop, dtype: float64

2.2.3 Working with rows



In [26]:

    
# Select multiple rows
df[2:4]



In [27]:

    
# Select the last row
df[-1:]









    Out[27]:







  
    
      
      Region
      Pop
      Area
      Density
    
  
  
    
      4
      West
      903010
      201.3
      4485.89



In [28]:

    
# Select all but last row
df[:-1]



In [29]:

    
# Select all even rows
df[::2]



In [30]:

    
# Select by .iloc
df.iloc[0:2,1:3]



In [31]:

    
# Select by .loc
df.loc[0:1,['Pop','Area']]

2.2.4 Conditional Selection



In [24]:

    
# Boolean masking
df['Pop']>800000









    Out[24]:





0     True
1    False
2    False
3     True
4     True
Name: Pop, dtype: bool



In [25]:

    
# Select rows by boolean masking
df[df['Pop']>800000]









    Out[25]:







  
    
      
      Region
      Pop
      Area
      Density
    
  
  
    
      0
      Central
      939890
      132.7
      7082.818387
    
    
      3
      North-East
      834450
      103.9
      8031.280077
    
    
      4
      West
      903010
      201.3
      4485.891704



In [26]:

    
# Boolean masking with ==
df['Region']=='Central'









    Out[26]:





0     True
1    False
2    False
3    False
4    False
Name: Region, dtype: bool



In [27]:

    
# Select rows by boolean masking
df[df['Region']=='Central']









    Out[27]:







  
    
      
      Region
      Pop
      Area
      Density
    
  
  
    
      0
      Central
      939890
      132.7
      7082.818387



In [28]:

    
# Using .loc to find the Area of the Central region.
df.loc[df['Region']=='Central', 'Area']









    Out[28]:





0    132.7
Name: Area, dtype: float64



In [29]:

    
# Multiple conditions (and: &) (or: |)
(df['Pop'] < 800000) & (df['Density']<8000)









    Out[29]:





0    False
1     True
2     True
3    False
4    False
dtype: bool



In [30]:

    
# Select rows by multiple conditions
df[(df['Pop'] < 800000) & (df['Density']<8000)]









    Out[30]:







  
    
      
      Region
      Pop
      Area
      Density
    
  
  
    
      1
      East
      693500
      93.1
      7448.979592
    
    
      2
      North
      531860
      134.5
      3954.349442



In [31]:

    
# Using .query method

df.query("Pop < 800000 & Density < 8000")









    Out[31]:







  
    
      
      Region
      Pop
      Area
      Density
    
  
  
    
      1
      East
      693500
      93.1
      7448.979592
    
    
      2
      North
      531860
      134.5
      3954.349442

2.2.5 Case Study: Olympic Games

We can import data from a csv file by using pd.read_csv. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

We can import data from an Excel file by using pd.read_excel. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Source of data: https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table



In [32]:

    
# Import data from Excel file
og = pd.read_excel('OlympicGames.xlsx')
# Display first 5 rows of dataframe
og.head()









    Out[32]:







  
    
      
      Country
      Games
      Gold
      Silver
      Bronze
    
  
  
    
      0
      Afghanistan
      14
      0
      0
      2
    
    
      1
      Algeria
      13
      5
      4
      8
    
    
      2
      Argentina
      24
      21
      25
      28
    
    
      3
      Armenia
      6
      2
      5
      7
    
    
      4
      Australasia
      2
      3
      4
      5

Q1: How many rows and columns are there in the dataframe?



In [33]:

    
og.shape









    Out[33]:





(151, 5)

Q2: Are there any missing values in the dataframe?



In [34]:

    
#og.isnull().sum()
og.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 151 entries, 0 to 150
Data columns (total 5 columns):
Country    151 non-null object
Games      151 non-null int64
Gold       151 non-null int64
Silver     151 non-null int64
Bronze     151 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.1+ KB

Q3: Create a new column for total number of Olympic medals.



In [35]:

    
og['Total'] = og['Gold']+og['Silver']+og['Bronze']
og.head()









    Out[35]:







  
    
      
      Country
      Games
      Gold
      Silver
      Bronze
      Total
    
  
  
    
      0
      Afghanistan
      14
      0
      0
      2
      2
    
    
      1
      Algeria
      13
      5
      4
      8
      17
    
    
      2
      Argentina
      24
      21
      25
      28
      74
    
    
      3
      Armenia
      6
      2
      5
      7
      14
    
    
      4
      Australasia
      2
      3
      4
      5
      12

Q4: Select the row where Country = Singapore.



In [36]:

    
og[og['Country']=='Singapore']









    Out[36]:







  
    
      
      Country
      Games
      Gold
      Silver
      Bronze
      Total
    
  
  
    
      116
      Singapore
      16
      1
      2
      2
      5

Q5: Which country has won the highest number of Gold medals?



In [38]:

    
#og[og['Gold'] == og['Gold'].max()]['Country']
og['Country'][og['Gold'].idxmax()]









    Out[38]:





u'United States'

Q6: How many countries participated in at least 25 Olympic games (100 years)? Return an integer.



In [40]:

    
len(og.query("Games >= 25").loc[:, "Country"])









    Out[40]:





16

Q7: Which are the top 3 countries with highest total Olympic medals?

Challenge! Can you return a list of countries in one line of code?



In [48]:

    
og.sort_values(by='Total', ascending=False)[0:3]['Country'].values









    Out[48]:





array(['United States', 'Soviet Union', 'Great Britain'], dtype=object)

Q8: Out of the countries which have not won any Gold medals, which country has won the highest number of medals?

Challenge! Can you return the name of the country in one line of code?



In [42]:

    
og['Country'][og[og['Gold']==0]['Total'].idxmax()]
og.Country[og.query("Gold == 0").Total.idxmax()]









    Out[42]:





u'Malaysia'



In [ ]:

	Area	Population	Region
0	132.7	939890	Central
1	93.1	693500	East
2	134.5	531860	North
3	103.9	834450	North-East
4	201.3	903010	West

	Population	Area
count	5.00000	5.000000
mean	780542.00000	133.100000
std	167865.50828	42.148547
min	531860.00000	93.100000
25%	693500.00000	103.900000
50%	834450.00000	132.700000
75%	903010.00000	134.500000
max	939890.00000	201.300000

	Country	Games	Gold	Silver	Bronze
0	Afghanistan	14	0	0	2
1	Algeria	13	5	4	8
2	Argentina	24	21	25	28
3	Armenia	6	2	5	7
4	Australasia	2	3	4	5